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Abstract 

Nonparametric mixture models based on the Dirich- 
let process are an elegant alternative to finite models 
when the number of underlying components is un- 
known, but inference in such models can be slow. Ex- 
isting attempts to parallelize inference in such models 
have relied on introducing approximations, which can 
lead to inaccuracies in the posterior estimate. In this 
paper, we describe auxiliary variable representations 
for the Dirichlet process and the hierarchical Dirichlet 
process that allow us to sample from the true poste- 
rior in a distributed manner. We show that our ap- 
proach allows scalable inference without the deterio- 
ration in estimate quality that accompanies existing 
methods. 



1 Introduction 



Models based on the Dirichlet process (DP, |Fergu- 



son 



1973[ ) and its extension the hierarchical Dirich 



let process (HDP, Teh et al. 2006) have a number 



of appealing properties. They allow a countably infi- 
nite number of mixture components a priori, meaning 
that a finite dataset will be modeled using a finite, 
but random, number of parameters. If we observe 
more data, the model can grow in a consistent man- 
ner. Unfortunately, while this means that such mod- 
els can theoretically cope with data sets of arbitrary 
size and latent dimensionality, in practice inference 
can be slow, and the memory requirements are high. 

Parallelization is a technique often used to speed 
up computation, by splitting the computational and 
memory requirements of an algorithm onto multiple 
machines. Parallelization of an algorithm involves 



exploitation of (conditional) independencies. If we 
can update one part of a model independently of an- 
other part, we can split the corresponding sections 
of code onto separate processors or cores. Unfortu- 
nately, many models do not have appropriate inde- 
pendence structure, making parallelization difficult. 
For example, in the Chinese restaurant process rep- 
resentation of a Dirichlet process mixture model, the 
conditional distribution over the cluster allocation of 
a single data point depends on the allocations of all 
the other data points. 

In such cases, a typical approach is to apply ap- 
proximations that break some of the long-range de- 
pendencies. While this can allow us to parallelize 
inference in the approximate model, the posterior es- 
timate will, in general, be less accurate. Another 
option is to use a sequential Monte Carlo approach, 
where the posterior is approximated with a swarm of 
independent particles. In its simplest form, this ap- 
proach is inherently parallelizable, but such a naive 
implementation will run into problems of variance 
overestimation. We can improve the estimate quality 
by introducing global steps such as particle resam- 
pling and Gibbs steps, but these steps cannot easily 
be parallelized. 

In this paper, we show how the introduction of aux- 
iliary variables into the DP and HDP can create the 
conditional independence structure required to ob- 
tain a parallel Gibbs sampler, without introducing 
approximations. As a result, we can make use of the 
powerful and elegant representations provided by the 
DP and HDP, while maintaining manageable com- 
putational requirements. We show that the result- 
ing samplers are able to achieve significant speed-ups 
over existing inference schemes for the "exact" mod- 



els, with no loss in quality. By performing inference 
in the true model, we are able to achieve better re- 
sults than those obtained using approximate models. 

2 Background 

The Dirichlet process is a distribution over discrete 
probability measures D — J2T=i n k^9k with count- 
ably infinite support, where the finite-dimensional 
marginals are distributed according to a finite Dirich- 
let distribution. It is parametrized by a base proba- 
bility measure H , which determines the distribution 
of the atom locations, and a concentration parameter 
a > 0, which acts like an inverse variance. The DP 
can be used as the distribution over mixing measures 
in a nonparametric mixture model. In the Dirich- 



let process mixture model (DPMM, Antoniak 1974), 
data are assumed to be generated according 

to 



D ~ BP(a,H) 



D . 



(1) 



While the DP allows an infinite number of clusters 
a priori, any finite dataset will be modeled using a 
finite, but random, number of clusters. 

Hierarchical Dirichlet processes extend the DP to 
model grouped data. The HDP is a distribution over 
probability distributions D m , m — 1, . . . , M, each of 
which is marginally distributed according to a DP. 
These distributions are coupled using a discrete com- 
mon base-measure, itself distributed according to a 
DP. Each distribution D m can be used to model a 
collection of observations x r 



Do 



DP(a,ff) 



D, : 



^DP( 7 ,A)) 

~ f(6mi) 



(2) 



for m — 1, . . . , M and i — 1, . . . , N m . HDPs have 
been used to model data including text corpora [Teh| 
et al. ( 2006 ) , images Sudderth et al. ( 2005 ) , time se- 



ries data 



Fox et al. ( 2008 1 , and genetic variation Sohn 



k Xing| ( |2009 ). 

A number of inference schemes have been devel- 
oped for the DP and the HDP. The most commonly 
used class of inference methods is based on the Chi- 



nese restaurant process (CRP, see for example Al- 



measures (D in Eq. [T] D and {D m }^ =1 in Eq. [5} 
to obtain the conditional distribution for the cluster 
allocation of a single data point, conditioned on the 
allocations of all the other data points. A variety 
of Gibbs samplers based on the CRP can be con- 



structed for the DP Neal ( 1998 1 and the HDP Teh 



et al. (2006). Unfortunately, because the conditional 



distribution for the cluster allocation of a single data 
point depends on all the data, this step cannot be 
parallelized without introducing approximations. 

An alternative class of inference schemes involve 
explicitly instantiating the random measures Ish- 
waran k James (2001). In this case, the observa- 



tions are i.i.d. given the random measures, and can 
be sampled in parallel. However, since the random 
measure depends on the cluster allocations of all the 
data points, sampling the random measure cannot be 
parallelized. 

Among existing practical schemes for parallclizable 
inference in DP and HDP, the following three are the 
most popular: 



2.1 Sequential Monte Carlo methods 

Sequential Monte Carlo (SMC) methods approximate 
a distribution of interest using a swarm of weighted, 
sequentially updated, particles. SMC methods have 
been used to perform inference in a number of mod- 



els based on the DP Fearnhead (2004); 



(20101; Rodriguez (2011); Ahmed et al. (2011). Such 



Ulker et al. 



dous ( 1985 )). Such schemes integrate out the random 



methods are appealing when considering paralleliza- 
tion, because each particle (and its weight) are up- 
dated independently of the other particles, and need 
consider only one data point at a time. However, a 
naive implementation where each particle is propa- 
gated in isolation leads to very high variance in the 
resulting estimate. To avoid this, it is typical to in- 
troduce resampling steps, where the current swarm 
of particles is replaced by a new swarm sampled from 
the current posterior estimate. This avoids an explo- 
sion in the variance of our estimate, but the resam- 
pling cannot be performed in parallel. 



2 



2.2 Variational inference 



Variational Bayesian inference algorithms have been 



developed for both the DP Blei & Jordan (2004) 



Kurihara et al. (20071 and the HDP Teh ct al. (2007); 



Wang et al. (2011). Variational methods approxi- 



mate a posterior distribution p(9\X) with a distri- 
bution q(6) belonging to a more manageable family 
of distributions and try to find the "best" member 
of this family, typically by minimizing the Kullback- 
Leibler divergence between p(9\X) and q{0). This 
also gives us a lower bound on the log likelihood, 
logp(X). A typical approach to selecting the family 
of approximating distributions is to assume indepen- 
dencies that may not be present in the true posterior. 
This means that variational algorithms are often easy 
to parallelize. However, by searching only within a 
restricted class of models we lose some of the expres- 
siveness of the model, and will typically obtain less 
accurate results than MCMC methods that sample 
from the true posterior. 



2.3 Approximate parallel Gibbs sam- 
pling 

An approximate distributed Gibbs sampler for the 



HDP is described by Asuncion et al. (20081. The ba- 



sic sampler alternates distributed Gibbs steps with 
a global synchronization step. If we have P proces- 
sors, in the distributed Gibbs steps, each processor 
updates 1/P of the topic allocations. To sample the 
topic allocation for a given token, rather than use 
the current topic allocations of all the other tokens, 
we use the current topic allocations for the tokens on 
the same processor, and for all other tokens we use 
the allocations after the last synchronization step. In 
the synchronization step, we amalgamate the topic 
counts. This can run into problems with topic align- 
ment. In particular, there is no clear way to decide 
whether to merge a new topic on processor p with 
a new topic on processor p' . An asynchronous ver- 
sion of the algorithm avoids the bottleneck of a global 
synchronization step. 



3 Introducing auxiliary vari- 
ables to obtain conditional 
independence 

The key to developing a parallel inference algorithm 
is to exploit or introduce independencies. In the se- 
quential Monte Carlo samplers, this independence 
can lead to high variance in the resulting estimate. 
In the other algorithms described, the independence 
is only achieved by introducing approximations. 

If observations are modeled using a mixture model, 
then conditioned on the cluster allocations the ob- 
servations are independent. The key idea that al- 
lows us to introduce conditional independence is the 
fact that, for appropriate parameter settings, Dirich- 
let mixtures of Dirichlet processes are Dirichlet pro- 
cesses. In Theorems [3] and [4j we demonstrate sit- 
uations where this result holds, and develop mix- 
tures of nonparametric models with the appropriate 
marginal distributions and conditional independence 
structures. The resulting models exhibit conditional 
independence between parameters that are coupled 
in Eq. [T] and Eq. [2] allowing us to perform parallel 
inference in Section [4] without resorting to approxi- 
mations. 

Theorem 1 (Auxiliary variable representation for 
the DPMM) . We can re-write the generative process 
for a DPMM (given in Eq. [7p as 



Da ~ DP - 



(?■*) 



Dirichlet ( — 



/ a ( 
VP''"'] 
Xi ~ f(0 t ), 



(3) 



for j — 1, . . . , P and i = 1, . . . , N . The marginal 
distribution over the Xi remains the same. 

Proof. We prove a more general result, that if <j> ~ 
Dirichlet (pt\, ttp) and Dj ~ DP(aj, Hj), then 



dp(£ 7 



-)• 



A Dirichlet process with concentration parameter 
a and base probability measure H can be obtained 
by normalizing a gamma process with base measure 
aH. Gamma processes are examples of completely 



random measures Kingman (1967), and it is well 
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known that the superposition of P completely ran- 
dom measures is again a completely random measure. 
In particular, if Gj ~ G&P(ajHj), j = 1, . . . , P, then 
G:=E,G,~ GaPE jQj H 3 ). 

Note that the total masses of the Gj are gamma- 
distributed, and therefore the vector of normalized 
masses is Dirichlet-distributed. It follows that, if 
Dirichlet (ai, ...,ap) and Dj ~ DP(aj, Hj), then 



D 



E 7 <^ 



dp(E, 



Q: , 



E 3 "j 



). This result, 



which is well known in the nonparametric Bayes com- 
munity, is explored in more depth in Chapter 3 of 
Ghosh & Ramamoorthi (2003). An alternative proof 
is given in Appendix [A} □ 

The auxiliary variables 7Tj introduced in Eq. [7] in- 
troduce conditional independence, which we exploit 
to develop a distributed inference scheme. If we have 
P processors, then we can split our data among them 
according to the 7Tj. Conditioned on their processor 
allocations, the data points are distributed according 
to independent Dirichlet processes. In Section [4j we 
will see how we can combine local inference in these 
independent Dirichlet processes with global steps to 
move data between processors. 

We can follow a similar approach with the HDP, 
assigning each token in each data point to one of P 
groups corresponding to P processors. However, to 
ensure the higher level DP can be split in a man- 
ner consistent with the lower level DP, we must 
impose some additional structure into the genera- 
tive process described by Eq. [2] - specifically, that 
7 ~ Gamma(a). We can then introduce auxiliary 
variables as follows: 

Theorem 2 (Auxiliary variable representation for 
the HDP). If we incorporate the requirement that the 
concentration parameter 7 for the bottom level DPs 
{Dj}jL 1 depends on the concentration parameter a 
for the top level DP D as 7 ~ Gamma(a), then we 
can rewrite the generative process for the HDP as: 



Qj ~ Gamma(a/P) 
D oj ~ DP(a/P, H) 



v m — Dirichlet((i, . . . , £p) 



D 



"i j 



DP(Q,D, 



J f{6mi) 



(4) 



for j = 1, . . . , P , m = 1, . . . , M, and i = 1, . . . , N m . 
The marginal distribution over the x m i is the same 
in Eq.s^ and\i!\ 

Proof. If (j ~ Gamma(a/P), then 7 := YljCj ~ 
Gamma(a). Since Doj ~ DP (a/P, H) 1 it follows that 
G03 : = GAy ~ GaP(fff) and G := G 0j ~ 
GaP (aH). If we normalize Go, we find that Dq := 
J2j ^Ay ~ BP(a,H), as required by the HDP. 

If we write G m j ~ GaP(Goj) = GaP(QD j), then 
we can see that G m j = r\mjD m j, where T] m j ~ 
Gamma(Cj), and D m j ~ DP (( j, D j). 

If we normalize the G mj -, we find that 



D,. 



E 



Vtnj 



Efe Vmk 



D 



\)\-(\\ ^ CjDoj 



= DP( 7 ,Do), 



as required by the HDP. Since the rj m j only ap- 
pear as a normalized vector, we can write v m — 

~Dirichlet(Ci,...,Cp)- 
A more in-depth version of this proof is given in 
Appendix [A] □ 

Again, the application to distributed inference is 
clear: Conditioned on the ?r m j we can split our data 
into P independent HDPs. 



4 Inference 

The auxiliary variable representation for the DP in- 
troduced in Theorem [3] makes the cluster allocations 
for data points where 7Tj = j conditionally indepen- 
dent of the cluster allocations for data points where 
Hi j. We can therefore split the data onto P paral- 
lel processors or cores, based on the values of 7^. We 
will henceforth call 7Tj the "processor indicator" for 
the ith data point. We can Gibbs sample the clus- 
ter allocations on each processor independently, in- 
termittently moving data between clusters to ensure 
mixing of the sampler. 
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4.1 Parallel inference in the Dirichlet 
process 

We consider first the Dirichlet process. Under the 
construction described in Eq. [7J each data point Xi is 
associated with a processor indicator 7Tj and a cluster 
indicator 2j. All data points associated with a single 
cluster will have the same processor indicator, mean- 
ing that we can assign each cluster to one of the P 
processors (i.e., all data points in a single cluster are 
assigned to the same processor). Note that the jth 
processor will typically be associated with multiple 
clusters, corresponding to the local Dirichlet process 
Dj. Conditioned on the assignments of the auxiliary 
variables 7Tj , the data points cc, in Eq. [7] depend only 
on the local Dirichlet process Dj and the associated 
parameters. 

We can easily marginalize out the Dj and <j>. As- 
sume that each data point Xi is assigned to a pro- 
cessor iti € { 1 , . . . , P} , and a cluster z\ residing on 
that processor (i.e., all data points in a single cluster 
are assigned to the same processor) . We will perform 
local inference on the cluster assignments Zi, and in- 
termittently we will perform global inference on the 



4.1.1 Local inference: Sampling the Zi 

Conditioned on the processor assignments, sampling 
the cluster assignments proceeds exactly as in a nor- 
mal Dirichlet process with concentration parameter 
a/ P. A number of approaches for Gibbs sampling in 



the DPMM are described by Neal ( 1998 ) 



4.1.2 Global inference: Sampling the 7Tj 

Under the auxiliary variable scheme, each cluster is 
associated with a single processor. We jointly resam- 
ple the processor allocations of all data points within 
a given cluster, allowing us to move an entire cluster 
from one processor to another. We use a Metropolis 
Hastings step with a proposal distribution indepen- 
dently assigns cluster k to processor j with proba- 
bility 1/P. This means our accept/reject probability 
depends only on the ratio of the likelihoods of the 
two assignments. 



The likelihood ratio is given by: 

p(K|) ^(I^IKMKIK^) 

P({*i}) P{{xi}Wi)p({n}\a,P) 

= P ({zJKMK*}| a ,p) 
KMkMWkP) (5) 

p max(jVj,A r *) 

ii ii :'r 

1=1 z=i y 



where Nj is the number of data points on processor j, 
and a>ij is the number of clusters of size i on processor 
j. In fact, we can simplify Eq. [13] further, since many 
of the terms in the ratio of factorials will cancel. A 



derivation of Eq. 13 is given in Appendix |B.1| 

The reassignment of clusters can be implemented 
in a number of different manners. Actually trans- 
ferring data from one processor to another will lead 
to bottlenecks, but may be appropriate if the entire 
data set is too large to be stored in memory on a 
single machine. If we can store a copy of the dataset 
on each machine, or we are using multiple cores on 
a single machine, we can simply transfer updates to 
lists of which data points belong to which cluster on 
which machine. We note that the reassignments need 
not occur at the same time, reducing the bandwidth 
required. 

4.2 Parallel inference in the HDP 

Again, we can assign tokens x m i to one of P proces- 
sors according to ir m i. Conditioned on the processor 
assignment and the values of Q , the data on each pro- 
cessor is distributed according to a HDP. We instan- 
tiate the processor allocations Tr mi and the bottom- 
level DP parameters, plus sufficient representation to 
perform inference in the processor-specific HDPs. 

4.2.1 Local inference: Sampling the table 
and dish allocations 

Conditioned on the processor assignments, we simply 
have P independent HDPs, and can use any existing 
inference algorithm for the HDP. In our experiments, 
we used the Chinese restaurant franchise sampling 



scheme Teh et al. (2006). 
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4.2.2 Global inference: 
the Q 



Sampling the n m j and 



We 



can represent the Q as Q :— 7^, where 
7 ~ Gamma(a, 1) and £ := (£1, ■■■,&) ~ 
Dirichlet (a/P, . . . , a/P). We sample the 7r mj - and the 
£j jointly, and then sample 7, in order to improve the 
acceptance ratio of our Metropolis-Hastings steps. 

Again, we want to reallocate whole clusters rather 
than independently reallocate individual tokens. So, 
our proposal distribution again assigns cluster k to 
processor j with probability 1/P. Note that this 
means that a single data point does not necessar- 
ily reside on a single processor - its tokens may be 
split among multiple processors. We also propose 
~ Dirichlet (a/P, . . . , a/P), and accept the result- 
ing state with probability min(l, r), where 



both the speed of the algorithms and the quality of 
the approximations obtained. 

5.1 Dirichlet process mixture of Gaus- 
sians 

We begin by evaluating the performance of the 



Dirichlet process sampler described in Section 4.1 



We generated a synthetic data set of one million data 
points from a mixture of univariate Gaussians. We 
used 50 components, each with mean distributed ac- 
cording to Uniform(0, 10) and fixed variance of 0.01. 
A synthetic data set was chosen because it allows us 
to compare performance with "ground truth" . We 
compared four inference algorithms: 



P{{ x mi}\{-K* mi }, 1, C,a, P) p{{K* mi \\l, QpjC \a, P) 
p({Xmi}\{Tr m i},J,£,a,P) p({7r roi }|7,^ p(€\a,P) 



p { C ] ) T ^ a/P T*\ T(a/P + T. J ) , ^r b Tl \ 

11 (tAT.i+a/P TJTlnlP 4-T*.\ 11 



M 

n 



U fojW T . 3 \ T(a/P + T*) 11 11 a* ml ! 

(6) 

A derivation of Eq. [6] is given in Appendix |B.2| As 
before, many of the ratios can be simplified further, 
reducing computational costs. 

As with the Dirichlet process sampler, we can ei- 
ther transfer the data between machines, or simply 
update lists of which data points are "active" on each 
machine. We can resample 7 after sampling the £j 
and Tr m j using a standard Metropolis Hastings step 



Auxiliary variable parallel Gibbs sampler 
(AVparallel) - the model proposed in this pa- 
per, implemented in Java. 

Sequential Monte Carlo (SMC) - a basic 
sequential importance resampling algorithm, im- 
plemented in Java. 

Variational Bayes (VB) - the collapsed varia- 
tional Bayesian algorithm described in |Kurih ara 
et al. ( 2007 ) . We used an existing Matlab imple- 
ment atioij 

Asynchronous parallel DP (Asynch) - we 

modified the Asynchronous sampler for the HDP 



Asuncion ct al. (2008) to be applicable to the DP. 



We implemented the sampler in Java, using the 



settings described in Asuncion et al. (2008 1 . 



(described in Appendix B.3 ) 



5 Experimental evaluation 

Our goal in this paper is to employ parallelization to 
speed up inference in the DP and HDP, without intro- 
ducing approximations that might compromise the 
quality of our model estimate. To establish whether 
we have achieved that goal, we perform inference in 
both the DP and HDP using a number of inference 
methods, and look at appropriate measures of model 
quality as a function of inference time. This captures 



In each case, we ran the algorithms on one, two, 
four and eight processors on a single multi-core ma- 
chine]^ with one processor corresponding to the reg- 
ular Gibbs sampler. We initialized each algorithm 
by clustering the data into 80 clusters using K-means 
clustering, and split the resulting clusters among the 
processors. 

We consider the Fl score between the clusterings 
obtained by each algorithm and the ground truth, as 



Code obtained from https://sites.google.com/site/ 
kenichikurihara / academic-software / variational-dirichlct- 
process-gaussian-mixture-model 

2 An AMD FX 8150 3.6 GHz (8 core) with 16 gig ram 
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(a) Fl score against run time for AVparallel. 



(b) Fl score against run time for various algorithms. 
Unless otherwise specified, eight processors are used. 



Local Gibbs Step 
—Global MCSTEP 




2 4 
Number of processors 



4 6 
Number of processors 



(c) Time taken to reach convergence (< 0.1% change in 
Fl). 



(d) Time spent in global and local steps for AVparallel, 
over 500 iterations. 



Figure 1: Synthetic data modeled using a DPMM. 



a function of time. Let be the set of pairs of ob- 
servations that are in the same cluster under ground 
truth, and let 7> (m) be the set of pairs of observations 
that are in the same cluster in the inferred model. 
Then we define the Fl score of a model as the har- 
monic mean between the precision - the proportion 
|7?(9) n -p( m )|/|7'( m) | of pairs that are co-clustered by 
the model that also co-occur in the true partition - 
and the recall - the proportion \T> {9) n V (m) \/\V {9) \ 



of true co-occurrences that are co-clustered by the 
model. This definition of Fl is invariant to permuta- 
tion, and so is appropriate in an unsupervised setting 



Xing et al. (2002) 



Figure l(a)| shows the Fl scores for our auxiliary 
variable method over time, using one, two, four and 
eight processors. As we can see, increasing the num- 
ber of processors increases convergence speed with- 
out decreasing performance. Figure 1(b) shows the 
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Fl scores over time for the four methods, each us- 
ing eight cores. While we can get very fast results 
using variational Bayesian inference, the quality of 
the estimate is poor. Conversely, we achieve better 
performance (as measured by Fl score) than com- 
peting MCMC algorithms, with faster convergence. 



Figure 1(c) shows the time taken by each algorithm 



to reach convergence, for varying numbers of proces- 
sors. AV parallel and Asynch perform similarly. Fig- 
ure 1(d) shows the relative time spent sampling the 



processor allocations (the global step) and sampling 
the cluster allocations (the local step), over 500 it- 
erations. This explains the similar scalability of AV- 
parallel and Asynch: In AVparallel the majority of 
time is spent on local Gibbs sampling, which is im- 
plemented identically in both models. 

5.2 HDP topic model 

Next, we evaluate the performance of the HDP sam- 
pler on a topic modeling task as described by Teh 
et al. (12006). Our dataset was a corpus of NIPS 
papers^] consisting of 2470 documents, containing 
14300 unique words and around 3 million total words. 
We split the dataset into a training set of 2220 docu- 
ments and a test set of 250 documents, and evaluated 
performance in terms of test set perplexity. We com- 
pared three inference methods: 

• Auxiliary variable parallel Gibbs sampler 
(AVparallel) - the model proposed in this pa- 
per, implemented in Java. 

• Variational Bayes (VB) - the collapsed varia- 
tional Bayesian algorithm described in Teh et al. 
(2007). We used an existing Java implementa- 

• Asynchronous parallel HDP (Asynch) 

the asynchronous sampler for the HDP |Asun-| 
cion et al. ( 2008 ) . We implemented the sampler 



in Java, using the settings described in the orig- 
inal paper. 



3 http:/ /ai. stanford.edu/ gal/data. html 

4 Code obtained from http://www.bradblock.com/tm- 
0.1. tar. gz 



Again, we ran each method on one, two, four and 
eight processors, and initialized each document to one 
of 80 clusters using K-means. 



Figure 2(a) shows the perplexity obtained using 



our auxiliary variable method over time, using one, 
two, four and eight processors. Figure 2(b) compares 



the perplexities obtained using each of the three infer- 
ence method over time, using eight processors. Fig- 
ure 2(c) shows the time to convergence of the three 



models for various numbers of processors. 

As with the DPMM, while the variational approach 
is able to obtain results very quickly, the quality is 
much lower than that obtained using MCMC meth- 
ods. The AVparallel method achieves much better 
perplexity than the approximate Asynch method - 
the difference is much more striking than that seen 
in the DPMM. Note that, in the synthetic data used 
for the DPMM model, the true clusters are of sim- 
ilar size, while in the real-world data used for the 
HDP experiment there are likely to be many small 
clusters. We hypothesize that while the errors intro- 
duced in the asynchronous approximate method have 
little effect if the clusters are large, they become more 
significant if we have small clusters. Again, we find 



(Figure 2(d) ) that the majority of time is spent in the 



local Gibbs sampler, meaning we can obtain a good 
rate of increase of speed by increasing the number of 
processors. 

6 Discussion and future work 

We have shown how alternative formulations for the 
DP and HDP can yield parallelizable Gibbs samplers 
that allow scalable and accurate inference. Our ex- 
periments show that the resulting algorithms offer 
better performance and scalability than existing par- 
allel inference methods. 

The scalability of our algorithms is limited by the 
size of the largest cluster, since each cluster must 
reside on a single processor. An interesting avenue 
for future development is to investigate approximate 
methods for splitting large clusters onto multiple pro- 
cessors. 

We have implemented and evaluated these meth- 
ods on multi-core machines; our next goals are to de- 
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(b) Test set perplexity against run time for various al- 
gorithms. Unless otherwise specified, eight processors 
are used. 
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Figure 2: NIPS corpus, modeled using an HDP. 
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A Theorem expanded proofs 

Theorem 3 (Auxiliary variable representation for 
the DPMM). We can re-write the generative process 
for a DPMM as 

Dj -DP^H), 4> ~ Dirichlet (£,..., |) , 

7Tj~<^, 6,^D^ z , Xi~ f{6i), 

(7) 

for j = 1, . . . , P and i = 1, . . . , N. The marginal 
distribution over the Xi remains the same. 

Proof. In the main paper, we proved the gen- 
eral result, that if <j> ~ Dirichlet(o!i, ctp) and 
Dj ~ DP(a j ,i? 7 ), then D := Y.j^jDj ~ 
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DP (£V dj, > ^'-" ii // ). This result has been used by can rewrite the generative process for the HDP as: 



authors including Rao fc Teh (2009) 



Here, we provide an explicit proof that shows the 
resulting predictive distribution is that of the Dirich- 
let process. 

Let 9i , 02, ■ ■ ■ be a sequence of random variable dis- 
tributed according to G ~ DP(a, Go). Then the con- 
ditional distribution of 9 n+ \ given 9%, . . . ,9 n where G 
has been integrated is given by 



n+l\9l, . ■ ■ , U n 



n , 

E^^ + ^ G ° ( 8 ) 

' — ' n + a n + a 



1=1 



Q ~ Gamma(a/P) n mi ~v m 

Ay ~ DP(a/P, H) 9 mi ~ D m7Tmi 

v m ~ Dirichlet((i, . . . , ( P ) x mi ~ f(9 mi ) 

D mj -,~DP(Cj,D 0j ), 

(10) 

for j = 1, . . . , P, m = 1, . . . , M, and i = 1, . . . , iV m . 

Proof. Let Cj ~ Gamma(a/F) and £>oj ~ 
DP(a/P,H), j = 1,...,P. This implies that 
Go, := CjD 0j ~ GaP((a/P)P) and 7 := £)J =1 ~ 
Gamma(a). 

By superposition of gamma processes, 



If Dj ~ DP(a/P, Go), - Dir(§ , . . . , f ), tr ~ ^ 
and 6*i ~ D ( p i then the conditional distribution of 
n+ i given 9i,...,9 n where Dj,Vj and </> has been 
integrated is given by 



9n+l|#l, • . • ,9 n 

p 

P(6 , n+lKri+l = j, 7Tl, . . . , 7T„ , 6*1, . . . 6> n , G ) 

rij + a/P 



E 



n + a 



a/P 



E , rn^0i^iri=j H r 77)^0 

^ nj + a/P rij + a/P 



n 

=E — * 



a 



1 = 1 



n + a n + a 



Go 



(9) 

□ 



G := E G 0j ~ GaP(aP) 

3=1 

p 

:=E0Ay 

3=1 



Normalizing Go, we get 



Go(-) ^ C. 



7 —7 



E^0i~ DP(a,P) 



as required by the HDP. 

Now, for m = 1, . . . , M and j = 1, . . . , P, let 7y m:) 
T(Q) and P m j ~ DP(£j, Dqj). This implies that 

G m j := T] m jD m j ~ GaP(CjPoj) 
:= GaP(G 0j ). 

Superposition of the gamma processes gives 

p 

G m '■= E^ ~ GaP(E^ G 0j) 
i j=i 

:= GaP(Go) = GaP( 7 P )- 



Theorem 4 (Auxiliary variable representation for 
the HDP). If we incorporate the requirement that the 
concentration parameter 7 for the bottom level DPs 
{Dj}y =1 depends on the concentration parameter a 
for the top level DP Dq as 7 ~ Gamma(a) , then we 



The total mass of G m is given by J2j=i Vmj, so 

An(0= ~DP( 7 ,A)) (11) 

as required by the HDP. 
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If we let v m j — T] m j I sum^ =1 r] m k, then we can 
rewrite Equation |11| as 

p 

AnQ = u rnjD mj ~ DP( 7 , D ), (12) 
where . . . , v mP ) ~ Dirichlet(Ci, • • ■ , Cp)- □ 



B Metropolis Hastings accep- 
tance probabilities 

In both cases, the proposal probabilities ( { 7r i } — > 
{-7T*}) = g({7r*} — > {iTi}), so we need only consider 
the likelihood ratios. 

B.l Dirichlet process 

In the Dirichlet process case, the likelihood ratio is 
given by: 

P(K» p({x t }\n*)p({n*}\a,P) 



p({ 7r J) p({xi}\n)p({n}\a, P) 

= p{{z i }\K)p({<)W,P) 
P({zi}\^)p{{TTi}\a, P) 

p mnx(Nj ,N*) 

=n n 

j=i <=i 



(13) 



7*.!' 



where Nj is the number of data points on processor j, 
and dij is the number of clusters of size i on processor 
j- 

The probability of the processor allocations is de- 
scribed by the Dirichlet compound multinomial, or 
multivariate Polya, distribution, 

p[{7Ti\ \a,ir) = — p p 

UUN,\T(N + j:f =1 a/P) 

^ T(Nj + a/P) 

j£ n^/p) 

N\ T(a) T(Nj + a/P) 

~nU N i lT{N+a) 7=i T{a/P) ' 



where N = J2j=i Nj is the total number of data 
points. So, 



p({TT*}\a,TT) 



g TV, r(7v; + a/p) 



p({n)\<*,v) *- L N*r(N 3 +a/P) 

j— 1 J 

Conditioned on the processor indicators, the prob- 
ability of the data can be written 

p 

P({zi}\Wi}) = lip({nj k }\Nj), 
j=i 

where rijk is the number of data points in the fcth 
on processor j. The distribution over cluster sizes in 
the Chinese restaurant process is described by Ewen's 
sampling formula, which gives: 



Nj\ 



N, 



T{a/P) -pj- 1 
P ) n«Unj k lWj + a/P)l[ aj l 



where Kj is the total number of clusters on processor 
j. Therefore, 



P({n* jk }\Nj) ^N*\ T(Nj + a/P) 



max(Nj,N*) 



p({njk}\Nj) 



Nj\ T{N* + a/P) 



n 



a* 1 ' 



so we get Equation [13] 



B.2 Hierarchical Dirichlet processes 

For the HDP, the likelihood ratio is given by 

P{{Xmi}\{^* m i 7, t,a, P) P(Utnj}\l, C) P(C\<*, P ) 

p{{x ml }\{7T ml -f,^,a,P) p({7T mi }|7,^ P(£\<*,P) 

The first term in the likelihood ratio is the ra- 
tio of the joint probabilities of the topic- and table- 
allocations in the local HDPs. This can be obtained 
by applying the Ewen's sampling formula to both top- 
and bottom-level DPs. Let tj be the count vector for 
the top-level DP on processor j - in Chinese restau- 
rant franchise terms, tjd is the number of tables on 
processor j serving dish d. Let rij m be the count 
vector for the mth bottom-level DP on processor j 
- in Chinese restaurant franchise terms, rij m k is the 
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number of customers in the mth restaurant sat at the 
kth table of the jth processor. Let T m j be the total 
number of occupied tables from the mth restaurant 
on processor j, and let Dj be the total number of 
unique dishes on processor j. Let Oj mi be the total 
number of tables in restaurant m on processor j with 
exactly i customers, and bji be the total number of 
dishes on processor j served at exactly i tables. We 



so the second term is given by 



n 

M 

n 



i vr( 7 c;) 



(15) 
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jm- 



II] 



use the notation rij 
etc. 

p{{n rm k}\lA) 

M P 

= n n^) 

rn — 1 j — 1 

and 

P({tjd}\a,P) 

(a\ D > ' T(a/P) |4 1 



m=l J m 

The third term is given by 

p 



! r( 7 6 + n 



p(C\a,P) 



n 

3=1 



(16) 



UT=1 "imfe! + ^1 <W 



Combining Equations 14 15 and 16 gives an ac- 
ceptance probability of min(l,r), where 



P({r jd h{n* jmk }\rest) 
p({t]d}A n jmk}\rest) 

p { ^ )T * r *, T{a/p + Tj) / r( 7 g) N A > : 
/i feF 7 TV r(a/p + r*.) ^r(7&y, 

maxfT.^T*) M ^ (14) 

tt Mil tt " im £njj + n jm-) 

max(Afj,AT*) 



n 



a- 1 



<l': 



n 



(e*) T - +tt/P T*.! r(a/P + T.j) ^ 



a/ 



LL ^T. s+a /p T.j! r(a/P + T*) 11 6* 4 



-i jmi 

"(17) 



B.3 Sampling 7 

We sample the HDP parameter 7 using reversible 
random walk Metropolis Hastings steps, giving an 
acceptance probability of 



min 1 



r(7*) 
r( 7 ) 



M M 

n 



T(n. m . + 7) 
\ r(n. m . + 7 *) 



i=l 



The probability of the processor assignments is 
given by: 



M 



p({wi7,£)= n 



T( 7 ) 



F 

m=i nj 
P 



TT r(7^ + »jm-) 

11 r( 7 0) : 
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